Skip to content

refactor(backend): comprehensive backend optimization — GPU management, security, testing, and code quality#9

Merged
sylvanding merged 21 commits intomainfrom
refactor/backend-comprehensive-optimization
Mar 18, 2026
Merged

refactor(backend): comprehensive backend optimization — GPU management, security, testing, and code quality#9
sylvanding merged 21 commits intomainfrom
refactor/backend-comprehensive-optimization

Conversation

@sylvanding
Copy link
Copy Markdown
Owner

Summary

This PR delivers a comprehensive backend optimization across 20 commits and 136 changed files (+14,731 / -811 lines), covering the following major areas:

🔧 Core Improvements

  • Async blocking fixes: Wrapped synchronous I/O calls (socket.getaddrinfo, subprocess.wait/read, fitz.open) with asyncio.to_thread() to prevent event loop blocking
  • Double commit elimination: Fixed redundant db.commit() calls in services that were already committed by callers
  • Exception swallowing fixes: Replaced bare except: pass with proper logging and re-raising
  • Config unification: Centralized environment variable management with Pydantic v2 Settings
  • Prompt centralization: Extracted all hardcoded LLM prompts into app/prompts/ module
  • RAG retrieval optimization: Improved hybrid retrieval with reranker integration

⚡ GPU Resource Management (New Feature)

  • TTL-based auto-unload: GPU models (embedding, reranker, PaddleOCR) are automatically unloaded after configurable idle time (MODEL_TTL_SECONDS)
  • GPU_MODE presets: conservative / balanced / aggressive presets for batch sizes and parallelism
  • GPU monitoring API: GET /gpu/status and POST /gpu/unload endpoints
  • MinerU subprocess auto-management: Auto start/stop MinerU with TTL, conda env isolation
  • Exit cleanup: atexit + SIGHUP handlers ensure GPU resources are released on program exit
  • External watchdog: scripts/gpu_watchdog.py daemon monitors process health and cleans up after crashes

🔒 Security Enhancements

  • SSRF protection: url_validator.py blocks requests to private/reserved IPs in RSS feeds and crawler
  • Project ID validation: Added Depends(get_project) to RAG, subscription, and search endpoints
  • Input validation: Literal type constraints for strategy/priority params, Pydantic request bodies for search
  • Rate limiting: Applied slowapi to writing stream endpoint

🗄️ Data Integrity

  • Paper unique constraint: UniqueConstraint("project_id", "doi") prevents duplicate papers at DB level
  • Pipeline persistence: Migrated from MemorySaver to AsyncSqliteSaver for checkpoint durability
  • Composite indexes: Added indexes on (project_id, status) and keyword.parent_id
  • Pipeline cancellation: Extracted to pipelines/cancellation.py module, removing reverse dependency

🐛 Bug Fixes

  • Pipeline data loss: Fixed missing new_paper field in ResolvedConflict schema that caused keep_new action to lose data
  • Resource leak: Fixed unclosed fitz.open() file handle in OCR service
  • LLM config fallback: Fixed temperature/max_tokens not respecting user-defined settings
  • Settings test_connection: Returns proper HTTP error status codes instead of 200 with error body
  • MinerU conda: Removed unsupported --no-banner flag from conda run command

🧪 Testing (178 → 526 tests)

  • 141 new API tests: Comprehensive coverage for all REST endpoints
  • E2E live server tests: 25 tests with real LLM integration
  • Unit tests: GPU model manager, MinerU process manager, reranker, URL validator, PDF metadata, chat pipeline
  • Stress tests: Concurrent request handling and pipeline orchestration

📝 Code Quality

  • Unified pagination: PaginationParams / KeywordPaginationParams Pydantic models
  • SSE error format: Consistent event: error\ndata: {"code": ..., "message": ...} across all streaming endpoints
  • OpenAPI tags: Proper endpoint grouping and documentation
  • Lambda removal: Replaced lambda model loaders with named functions for debuggability

📖 Documentation

  • Updated README (EN/ZH): New startup instructions (Alembic migration, GPU watchdog, MinerU setup)
  • Updated .env.example: All new config options documented
  • API endpoint catalog: docs/api-endpoints.md
  • Brainstorms & plans: 10+ design documents in docs/brainstorms/ and docs/plans/

Test Plan

  • All 526 backend tests pass (pytest tests/ -v)
  • All 2 skipped tests are intentional (GPU-dependent)
  • ruff check and ruff format clean
  • Pre-commit hooks pass on all commits
  • Database migration scripts verified (alembic upgrade head)
  • MinerU auto-management tested with conda run
  • GPU cleanup verified on process exit (atexit, SIGHUP)
  • Watchdog script tested with daemon mode

sylvanding and others added 21 commits March 17, 2026 17:46
…swallowing

- Wrap feedparser.parse, fitz _extract_local, and ChromaDB sync calls
  with asyncio.to_thread to avoid blocking the event loop
- Add count cache to RAGService to reduce redundant ChromaDB count()
  calls within a single request
- Remove manual db.commit() from conversations CRUD and persist_node;
  rely on get_session() auto-commit to prevent double commits
- Replace bare except-pass in rag_service with debug logging
- Upgrade MCP mount failure log from warning to error with traceback

Made-with: Cursor
… retrieval

- Sync config.py defaults with actual Qwen3 models (Embedding-0.6B, Reranker-0.6B-seq-cls)
- Centralize all LLM/VLM prompts into app/prompts/ module (chat, completion, dedup, keyword, rag, rewrite, writing)
- Add reranker service with singleton loading, semaphore concurrency control, and graceful fallback
- Implement batch adjacent chunk fetching to eliminate N+1 ChromaDB queries
- Enable MMR diversity via vector_store_query_mode with configurable threshold
- Tune HNSW index parameters (ef_construction=200, M=32, ef_search=100)
- Expose rag_top_k and use_reranker in Chat API with input validation
- Extract generic get_or_404 helper using PEP 695 type parameters
- Add rate limit, auth middleware, and API endpoint hardening

Made-with: Cursor
…cases

- Add 4 new test modules covering projects, papers, keywords, search, dedup,
  chat, RAG, writing, conversations, subscriptions, tasks, and settings APIs
- Support real_llm marker for Volcengine-dependent tests (2 tests)
- Verify SSE streaming events (start, text-delta, finish, [DONE])
- Test new reranker and RAG parameter exposure in Chat/RAG endpoints
- All 370 tests pass (2 skipped for real_llm when provider not configured)

Made-with: Cursor
…arch

- Document all 76 backend API endpoints with parameters and flags
- Add brainstorm docs for backend review and config/RAG/testing sessions
- Add implementation plans with acceptance criteria and research insights
- Include RAG retrieval optimization best practices research

Made-with: Cursor
…skipped)

Full end-to-end test suite against a live backend with Volcengine LLM:
- PDF upload and background processing (pdfplumber fallback)
- RAG index build, stats, and query with real LLM answers
- SSE streaming chat (basic + RAG-enhanced)
- Writing assistant (summarize, citations, review outline, gap analysis)
- Conversation persistence and settings APIs
- Auto-skips when server is unreachable

Made-with: Cursor
…ensive E2E tests

- Add ocr_parallel_limit config for controlling concurrent OCR tasks
- Refactor paper_processor.py from serial to parallel OCR with asyncio.gather,
  semaphore-based concurrency control, and round-robin GPU assignment
- Support CPU-only, single-GPU, and multi-GPU environments gracefully
- Add MinerU client unit tests (mocked HTTP) and E2E integration tests
- Add stress tests: 8-PDF concurrent upload, concurrent RAG queries, concurrent chat streams
- Add quality comparison tests: MinerU vs pdfplumber extraction metrics
- Add GPU utilization monitoring via nvidia-smi sampling during stress tests
- Enhance existing E2E tests with MinerU parsing verification
- Add MinerU deployment guide (docs/solutions/deployment/mineru-setup-guide.md)
- Add OCR_PARALLEL_LIMIT to .env.example

Test results: 394 unit/integration passed, 37 E2E passed (across 4 test suites)

Made-with: Cursor
- Add huggingface-hub as explicit dependency in pyproject.toml
  (was missing, causing RAG index build to fail with ImportError)
- Add GET /papers/{paper_id}/chunks API endpoint with ChunkRead schema
  (test_paper_chunks_have_sections was skipped because endpoint didn't exist)
- Implement smart GPU selection: _pick_best_gpu() chooses the device
  with the most free memory instead of always using cuda:0
- Add CUDA OOM auto-retry in RAG index build endpoint: clears GPU cache,
  reloads embedding model onto best available GPU, and retries
- Reduce embedding batch_size from 32 to 8 to lower peak GPU memory
- Reuse detect_gpu() in reranker_service for consistent GPU selection
- Add _cleanup_gpu_memory() (gc.collect + empty_cache) before model loads
- Add retry logic for flaky LLM responses in test_rag_query_with_real_llm
- Update test assertions for new cuda:N device string format

Results: 28/29 E2E tests pass (previously 27/29 with 2 skipped + failures)
Made-with: Cursor
Three presets (conservative/balanced/aggressive) control batch sizes,
parallelism, and GPU pinning across embedding, reranker, and OCR services.
Users can override any parameter individually via .env. Default mode is
balanced for backward compatibility; .env set to conservative for current
debugging phase with CUDA_VISIBLE_DEVICES=6,7.

Made-with: Cursor
…ts across 5 phases

Phase 1 (P0 Critical):
- Fix OCR blocking event loop with asyncio.to_thread()
- Implement pipeline cancellation with shared state + asyncio.Task.cancel()
- Add SSRF prevention (url_validator.py) + DOI format validation
- Save asyncio.create_task references to prevent GC

Phase 2 (API Consistency):
- Unify error responses: HTTPException + ValidationError → ApiResponse format
- Strengthen Schema validation: Literal types, max_length, ge/le constraints
- Fix non-serializable ValueError in validation error handler

Phase 3 (API Completion):
- Persist pipeline state to Task table
- Add pipeline list endpoint + typed ResumeRequest
- Add batch delete papers endpoint
- Add composite indexes (paper/task project+status) + Alembic migration

Phase 4 (MCP & Middleware):
- Add 4 MCP tools: summarize_papers, generate_review_outline, analyze_gaps, manage_keywords
- Add MCP input validation (top_k, max_results bounds)
- Add per-endpoint rate limiting (chat 30/min, OCR 5/min, RAG 5/min, pipeline 10/min)
- Add subscription auto_import parameter
- Remove llm_client.py shim, unify LLM imports
- Expand Schema __init__.py exports

Phase 5 (WebSocket & Polish):
- Add WebSocket ConnectionManager with room-based broadcasts
- Add pipeline WebSocket endpoint for real-time status
- Add /health endpoint
- Improve CORS config (expose_headers, max_age)
- Restrict API key to header-only (no query params)
- Add project export/import endpoints
- Disable rate limiting in test environment
- Add 33 new tests (url_validator, middleware, batch delete, export/import, WS manager, schema validation)
- Fix existing tests for new error format and Literal constraints

409 tests passing, ruff clean.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
Made-with: Cursor
…g gaps fix

- Extract hardcoded constants to config.py (S2 API, rewrite timeout,
  title similarity threshold, app version)
- Unify citation_graph_service error handling to use HTTPException
  instead of returning 200 with error dict
- Narrow rewrite.py exception handling from broad Exception to specific types
- Use Path.is_relative_to() for safer path validation in pipelines
- Add LLMConfigResolver unit tests (12 tests covering from_env/from_merged)
- Add RerankerService unit tests (7 tests covering caching and fallback)
- Add MCP tool tests for all 7 previously untested tools (20 new tests)
- Add Pipeline real PDF integration tests with HITL flow
- Add Chat tool_mode tests for citation_lookup, review_outline, gap_analysis

Total: 498 tests passing (up from ~409)
Made-with: Cursor
…ocess control

- Add GPUModelManager with TTL-based auto-unloading (default 5min idle)
- Add MinerUProcessManager for auto start/stop of MinerU subprocess
- Refactor embedding_service and reranker_service to use GPUModelManager
- Add OCRService.close() and context manager for explicit GPU cleanup
- Add GPU monitoring API: GET /api/v1/gpu/status, POST /api/v1/gpu/unload
- Integrate managers into FastAPI lifespan (startup/shutdown)
- Add config fields: model_ttl_seconds, mineru_auto_manage, mineru_ttl_seconds
- Add 30 new tests (GPUModelManager, MinerUProcessManager, GPU API)

Models are loaded on-demand and released after idle timeout to minimize
GPU memory usage when the system is not actively processing requests.

Made-with: Cursor
The current conda version does not support the --no-banner argument,
causing MinerU auto-start to silently fail and fall back to pdfplumber.

Made-with: Cursor
…rity, resource leaks

- Fix ResolvedConflict missing new_paper field causing keep_new data loss
- Add merge action support in apply_resolution_node
- Extract pipeline cancellation to shared module, fix memory leak
- Wrap blocking socket.getaddrinfo/process.wait in asyncio.to_thread
- Fix fitz.open resource leak with context manager
- Add SSRF validation for subscription feed URLs
- Add project existence checks for rag/subscription/search endpoints

Made-with: Cursor
…, input validation

Phase 2: Data integrity + Pipeline persistence
- Add Paper (project_id, doi) unique constraint with Alembic migration
- Replace MemorySaver with AsyncSqliteSaver for pipeline checkpointing
- Add pipeline_checkpoint_db config field

Phase 3: Code quality refactoring
- Extract GPU memory cleanup to shared gpu_utils.py
- Unify OCR calls to use process_pdf_async (MinerU priority)
- Fix LLM config resolver temperature/max_tokens fallback
- Fix hardcoded /tmp path in OCR service
- Replace lambda with explicit helper functions in embedding_service
- Add engine.dispose() on application shutdown

Phase 4: Input validation + API consistency
- Add unified PaginationParams for all list endpoints
- Add Literal type constraints for dedup strategy and crawler priority
- Add SearchExecuteRequest Pydantic model for search API
- Add typed Pydantic models for project import data

Made-with: Cursor
- Add 6 unit tests for pdf_metadata service (normal/corrupted/no-doi/crossref)
- Extend paper API tests with chunks and 404 coverage
- Add shared fixtures to conftest.py for new tests

Made-with: Cursor
…imits, indexes

- Add summary to all API endpoints for OpenAPI documentation
- Unify SSE error format with format_sse_error helper
- Add rate limiting to writing stream endpoint
- Extract citation error messages to constants
- Add reranker top_n/batch_size documentation
- Add Keyword parent_id index with Alembic migration
- Update frontend subscription API for pagination compatibility

Made-with: Cursor
…and gpu_utils

- Disable paper DOI unique constraint in dedup test fixtures
- Update search tests to use JSON body instead of query params
- Fix pipeline tests for task cleanup timing
- Update gpu_model_manager tests to mock gpu_utils.gc
- Mock validate_url_safe in subscription tests for SSRF bypass

Made-with: Cursor
Two-layer safety net for GPU cleanup on all exit scenarios:

Layer 1 — In-process safety net:
- atexit handler for sync cleanup (GPU unload + MinerU kill)
- SIGHUP handler for terminal close
- Enhanced MinerU stop() kills external processes by port lookup
- PID file for watchdog coordination

Layer 2 — External watchdog script:
- Independent process monitors Omelette via PID file
- Cleans up GPU resources after any exit (including kill -9, OOM)
- Supports daemon mode for background operation

Covers: Ctrl+C, kill, kill -9, OOM/crash, terminal close
Made-with: Cursor
…atures

Add GPU TTL, MinerU auto-management, watchdog, and Alembic migration
instructions to both EN/ZH README files. Sync .env.example with new
config options introduced in this branch.

Made-with: Cursor
@sylvanding sylvanding merged commit 0c46ba6 into main Mar 18, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant